102 research outputs found
A non-projective greedy dependency parser with bidirectional LSTMs
The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation of
the Covington (2001) algorithm for non-projective dependency parsing. The
bidirectional LSTM approach by Kipperwasser and Goldberg (2016) is used to
train a greedy parser with a dynamic oracle to mitigate error propagation. The
model participated in the CoNLL 2017 UD Shared Task. In spite of not using any
ensemble methods and using the baseline segmentation and PoS tagging, the
parser obtained good results on both macro-average LAS and UAS in the big
treebanks category (55 languages), ranking 7th out of 33 teams. In the all
treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the
all and big categories is mainly due to the poor performance on four parallel
PUD treebanks, suggesting that some `suffixed' treebanks (e.g. Spanish-AnCora)
perform poorly on cross-treebank settings, which does not occur with the
corresponding `unsuffixed' treebank (e.g. Spanish). By changing that, we obtain
the 11th best LAS among all runs (official and unofficial). The code is made
available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSEComment: 12 pages, 2 figures, 5 table
Better, Faster, Stronger Sequence Tagging Constituent Parsers
Sequence tagging models for constituent parsing are faster, but less accurate
than other types of parsers. In this work, we address the following weaknesses
of such constituent parsers: (a) high error rates around closing brackets of
long constituents, (b) large label sets, leading to sparsity, and (c) error
propagation arising from greedy decoding. To effectively close brackets, we
train a model that learns to switch between tagging schemes. To reduce
sparsity, we decompose the label set and use multi-task learning to jointly
learn to predict sublabels. Finally, we mitigate issues from greedy decoding
through auxiliary losses and sentence-level fine-tuning with policy gradient.
Combining these techniques, we clearly surpass the performance of sequence
tagging constituent parsers on the English and Chinese Penn Treebanks, and
reduce their parsing time even further. On the SPMRL datasets, we observe even
greater improvements across the board, including a new state of the art on
Basque, Hebrew, Polish and Swedish.Comment: NAACL 2019 (long papers). Contains corrigendu
Misspelled queries in cross-language IR: analysis and management
Este artÃculo estudia el impacto de los errores ortográficos en las consultas sobre el rendimiento de los sistemas de recuperación de información multilingüe, proponiendo dos estrategias para su tratamiento: el empleo de técnicas de corrección ortográfica automática y la utilización de n-gramas de caracteres como términos Ãndice y unidad de traducción, para asà aprovecharnos de su robustez inherente. Los resultados demuestran la sensibilidad de estos sistemas frente a dichos errores asà como la efectividad de las soluciones propuestas. Hasta donde alcanza nuestro conocimiento no existen trabajos similares en el ámbito multilingüe.This paper studies the impact of misspelled queries on the performance of Cross-Language Information Retrieval systems and proposes two strategies for dealing with them: the use of automatic spelling correction techniques and the use of character n-grams both as index terms and translation units, thus allowing to take advantage of their inherent robustness. Our results demonstrate the sensitivity of these systems to such errors and the effectiveness of the proposed solutions. To the best of our knowledge there are no similar jobs in the cross-language field.Trabajo parcialmente subvencionado por el Ministerio de EconomÃa y Competitividad y FEDER (proyectos TIN2010-18552-C03-01 y TIN2010-18552-C03-02) y por la Xunta de Galicia (ayudas CN 2012/008, CN 2012/317 y CN 2012/319)
One model, two languages: training bilingual parsers with harmonized treebanks
We introduce an approach to train lexicalized parsers using bilingual corpora
obtained by merging harmonized treebanks of different languages, producing
parsers that can analyze sentences in either of the learned languages, or even
sentences that mix both. We test the approach on the Universal Dependency
Treebanks, training with MaltParser and MaltOptimizer. The results show that
these bilingual parsers are more than competitive, as most combinations not
only preserve accuracy, but some even achieve significant improvements over the
corresponding monolingual parsers. Preliminary experiments also show the
approach to be promising on texts with code-switching and when more languages
are added.Comment: 7 pages, 4 tables, 1 figur
- …